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Forward Automatic Differentiation (AD) is a technique for augmenting programs 
to both perform their original calculation and also compute its directional deriva- 
tive. The essence of Forward AD is to attach a derivative value to each number, 
u: and propagate these through the computation. When derivatives are nested, the 

distinct derivative calculations, and their associated attached values, must be distin- 
guished. In dynamic languages this is typically accomplished by creating a unique 
tag for each application of the derivative operator, tagging the attached values, and 
overloading the arithmetic operators. We exhibit a subtle bug, present in fielded 
J> \ implementations, in which perturbations are confused despite the tagging machin- 
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1 Forward AD using Tagged Tangents 



Forward AD flWengertl . Il964l ) computes the derivative of a function / : R — > a at a point c 
by evaluating f(c+e) under a nonstandard interpretation that associates a conceptually 
infinitesimal perturbation with ea ch real n umbe r, propagates these augmented values 



according to the rules of calculus (ILeibnizl . Il664l ). and extracts the perturbation of the 



result. When x is a number, we use x + xe to denote a tangent- vector bundle: the primal 
value x bundled with the tangent value x, where x has the same type as x. We consider 
this tangent-vector bundle to also be a number, with arithmetic defined by regarding it 
as a truncated power series, or equivalently, by taking s 2 = but s ^ 0. Thi s implies 



that f (x + xe) = f(x) + xf'(x)e where f'(x) is the first derivative of / at x ((Newton 



1704n . 
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We can define a first-derivative operatoiQ by 



Vfx = tge(f 



x 



where e is fresh 



(1) 



In order for T> to nest correctly we must disting uish between di f ferent sets of tangent 
spaces introduced by di fferent invocations of D flLavendhomnie . 1996 ). which can be 
implemented by tagging (ISiskind and Pearlmutterl . 120051 . 120081 ). We will indicate different 
tags by different subscripts on e. The tangent extraction function tg extracts the tangent 
part of a tangent- vector bundle, with the appropriate tag indicated in the first argument. 



The tangent part of a numeric tangent-vector bundle is: 

tg e (a + be) — b 



(2) 



When the primal part of the tangent-vector bundle is a function, the tangent-vector 
bundle is itself a function and tg is defined by post-composition: 



tg e (Xx . e) = Xx . tg e e 



(3) 



This is the technique used to implement Forward AD in dynamic languages: arithmetic 
operators are overloaded to handle the chosen representation of numeric tangent-vector 
bundles, with tags generated using "gensym" or an analogous mechanism. 



2 A Bug 

If we have properly defined T> and tg, then we can reasonably expect to use them to cal- 
culate correct derivatives in commonly occurring mathematical situations. In particular, 
if we define an offset operator: 

s : E ->■ (E -> a) -> (E -> a) 

s u f x — f (x + u) (4) 

the derivative of s at zero should be the same as the derivative operator: if we define 

V = V s (5) 

then T> = T> should hold, since 

V f y = tg e (f (y + e)) = tg e (f(y) + f'(y)e) = f(y) (6a) 

V f y = V s / y = (d/du)s u f y\ u=0 

= (d/du)f(y + u)\ u=0 = f'(y) (6b) 

1 The type signature would be T> : (R — > a) — > R — > a' where a' is the tangent space of a. It is 
natural to equate R' = R, and because we only consider R and functions built on R, and we equate 
(a —> /?)' = a — > /?', and it follows from Church encoding that (a x j3)' = a' x /?', we can in all present 
examples equate a' = a. A full treatment of this topic is beyond our present scope. 
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Unfortunately, as we shall see, the above can exhibit a subtle bug: 



V (V f) x = ^ V (V f) x = f"(x) (7) 

This is not an artificial example. It is quite natural to construct an x-axis differential 
operator and apply it to a two-dimensional function twice, along the x and then y axis 
directions, by applying the operator, a rotation, and the operator again, thus creating 
precisely this sort of cascaded use of a defined differential operator. 

Note that 

£> = V s = tg e (s (0 + e)) 

= tg e (Xf . \x . f (x + s)) 

= Xf . Xx. tg e(f(x + £)) (8) 
Assuming that g : R — > R we can substitute using (JSj) and then reduce: 

V(Vg)y=(Xf .Xx. tg e (/ (x + e))) 

((Xf . Xx . tge(f(x + e)))g)y (9a) 
= (Xf .Xx. tge(/ (x + e))) 

(Ax . tg e (j (z + e))) y (9b) 

= (Ax . tg e ((Xx . tg e (g (x + e))) (x + e))) y (9c) 

= tg £((Ax. tg e(g(x + e))) (y + e)) (9d) 

= tge(tg£(j((i/ + e) + e))) (9e) 

= tg e(tg e(g(y + 2e))) (9f) 

= tg £(tg e(g(y) + 2g'(y)e)) (9g) 

= tg e (2g'(y)) (9h) 

= (9i) 

This went wrong, yielding instead of g"(y), because the tag e was generated exactly 
once, when the definition of T> was reduced to normal form in (JSJ). The instantiation of 
T> is the point at which a fresh tag is introduced; early instantiation can result in reuse 
of the same tag in logically distinct derivative calculations. Here, the first derivative 
and the second derivative become confused at ([9j). We have two nested applications of 
tg for £, but for correctness these should be distinctly tagged: E\ vs 62- If 2? were not 
already reduced to normal form, and we instead substitute its definition, then T> will be 



3 



instantiated twice, giving two fresh tags and a correct result: 



V(Vg)y = VsO(VsOg)y (10a) 
= (Xf . Ax . tg £l (f (x + sx))) 

{(Xf.Xx. tg e 2 (f (x + s 2 ))) g) y (10b) 
= (A/.Ax. tg e x (f (x + ei))) 

(Ax . tg e 2 (g (x + e 2 ))) y (10c) 

= (Ax . tg £l ((Ax . tg e 2 (g (x + e 2 ))) (x + ei))) y (lOd) 

= tg ei ((Ax . tg e 2 (g (x + e 2 ))) (y + ei)) (lOe) 

= tge 1 (tge 2 (g((y + e 1 )+e 2 ))) (lOf) 

= tg £l (tg e 2 (^(j/ + £l ) + + ei)e 2 )) (lOg) 

= tg £l g'(y + e 1 ) (lOh) 

= tg £l (g'(y)+g"{y)e 1 ) (lOi) 

= g"(y) (lOj) 



3 Discussion 

In a Forward AD system which uses tags to distinguish instances of T>, eta reduction is 
unsound. The definition Vfy = Vs0fy must not be eta reduced to T> — T> s 0, and 
one must not memoize or hoist T> s 0, as it is impure due to the requirement for a fresh 
tag. Even the above constraint can be insufficient when T> is applied to a function that 
is not R — y R but instead R — > a for some other a. In fact, expanded variants of T> are 
needed for various a. For instance, applying T> to a function R —>•——)■ R — >• R requires 

an eta-reduction-protected 

V n fVi ■■■ Vn = V s / yi ... y n (11) 

In general, T> should only be instantiated in a context that contains all arguments neces- 
sary to subsequently allow the post-composition of the tg introduced by the instantiation 
of D to immediately beta reduce to a non-function-containing value. Note that tg dis- 
tributes over aggregates like tuples and lists, further complicating the determination of 
when T> can be instantiated. 

Another alternative would be to guard the returned function object against tag collision. 
In a programming language with opaque closures, post-composition must be implemented 
using a wrapper: 

tg e (Ax . e) = Xy . tg e ((Ax . e) y) (12) 
This wrapper can be augmented to guard against the problem we have encountered: 

tg e x (Ax . e) = Xy . (swiz e 2 e x (tg e x ((Ax . e) (swiz e x e 2 y)))) ^ 

where e 2 is fresh 
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Here "swiz E\ 62 u" substitutes 82 for every occurrence of E\ in v. In a language with 
opaque closures, swiz must operate on function objects by appropriate pre- and post- 
composition. This technique was used to address the present issue in the 30-Aug-2011 
release of scmut i ls, a s oftware package that accompanies a textbook on classical mechanics 
( jSussman et al.l . l200ll ). in response to an early version of this manuscript. Unfortunately 
the computational burden of such "swizzling" violates the complexity guarantees of For- 
ward AD. This leaves us in the awkward position of there being no known technique 
for implementing Forward AD, with its defining complexity guarantee, and generalized 
to functions with higher-order outputs (including even curried functions), in a dynamic 
language. 

We have used fresh tags to implement a form of dependent typing, where a fresh set of 
tangent spaces is created each time T> is instantiated. Forward AD implementations in 
dynamically typed languages which support operator overloading (e.g., Scheme, Python) 
are susceptible to the problem we have exhibited due to the impurity of "gensym." It 
seems reasonable to speculate that static type systems (particularly those with at least 
some limited form of dependent typing such as existential types) may prevent this error. 
However, (a) current type systems prevent first-class automatic differentiation operators 
themselves from being defined, and (b) an intuition is a far cry from a proof. It is a current 
topic of resea rch to satisfactorily define a A-calculus based sys tem which correctly models 
Forward AD flEhrhard and Regnierl . 120031 : iManzvukl . l2012al lbl). 



Acknowledgments 

This work was supported, in part, by Science Foundation Ireland Principal Investigator 
grant 09/IN. 1/12637 and Army Research Laboratory Cooperative Agreement Number 
W911NF-10- 2-0060. The views and conclusions contained in this document are those of 
the authors and should not be interpreted as representing the official policies, either ex- 
press or implied, of SFI, ARL, or the Irish or U.S. Governments. The U.S. Government is 
authorized to reproduce and distribute reprints for Government purposes, notwithstand- 
ing any copyright notation herein. 



References 

Thomas Ehrhard and Laurent Regnier. The differential lambda-calculus. Theoretical 
Computer Science, 309 (1-3): 1-41, December 2003. 

Rene Lavendhomme. Basic Concepts of Synthetic Differential Geometry. Kluwer Aca- 
demic, 1996. 

Gottfried Wilhelm Leibniz. A new method for maxima and minima as well as tangents, 
which is impeded neither by fractional nor irrational quantities, and a remarkable type 
of calculus for this. Acta Eruditorum, 1664. 

Oleksandr Manzyuk. A simply typed A-calculus of forward automatic differ- 
entiation. In Mathematical Foundations of Programming Semantics Twenty- 



5 



eighth Annual Conference, pages 259-273, Bath, UK, June 6-9 2012a. URL 
http : / / dauns . math . tulane . edu/~mf ps/mf ps28p roc .pdf | 



Oleksandr Manzyuk. Tangent bundles in differential A-categories. Technical Report 



1202.0411, ArXiV, 2012b. URL http://arxiv.org/abs/1202.0411 



Isaac Newton. De quadratura curvarum, 1704. In Optiks, 1704 edition. Appendix. 

Jeffrey Mark Siskind and Barak A. Pearlmutter. Perturbation confusion and referential 
transparency: Correct functional implementation of forward-mode AD. In Andrew 
Butterfield, editor, Implementation and Application of Functional Languages — 17th 
International Workshop, IFL'05, pages 1-9, Dublin, Ireland, September 19-21 2005. 
Trinity College Dublin Computer Science Department Technical Report TCD-CS-2005- 
60. 

Jeffrey Mark Siskind and Barak A. Pearlmutter. Nesting forward-mode AD in a functional 
framework. Higher- Order and Symbolic Computation, 21(4):361-76, 2008. doi: 10. 
1007/sl0990-008-9037-l. 

Gerald Jay Sussman, Jack Wisdom, and Meinhard E. Mayer. Structure and Interpretation 
of Classical Mechanics. MIT Press, Cambridge, MA, 2001. 

Robert Edwin Wengert. A simple automatic derivative evaluation program. Comm. of 
the ACM, 7(8):463-4, 1964. 



6 



